The original content for Neural Network and Deep Learning course is produced and developed at Algoritma and is used as the main reference for Algoritma Academy
The primary objective of this course is to provide a fun and hands-on session to help participants gain full proficiency in data visualization systems and tools.
dplyr library:
select()filter()mutate()summarise()group_by() and ungroup()ggplotly to adds interactivity to ggplot2 objectggpubrsubplot() function from plotlyflexdashboardshiny libraryAs data grow in complexity and size, often times the designer is tasked with the difficult task of balancing overarching storytelling with specificity in their narrative. The designer is also tasked with striking a fine balance between coverage and details under the all-too-real constraints of static graphs and plots.
Interactive visualization is a mean of overcoming these constraints, and as we’ll see later, quite a successful one at that. Quoting from the author of superheat Rebecca Barter, “Interactivity allows the viewer to engage with your data in ways impossible by static graphs. With an interactive plot, viewers can zoom into areas they care about, highlight data points that are relevant to them and hide the information that isn’t.”
We’ll start by reading our data in. The data we’ll be using is the Women in Workforce data which is a historical data about womens’ earnings and employment status, specific occupation and earnings from 2013-2016, compiled from the Bureau of Labor Statistics and the [Census Bureau](https://www.census.gov/.
Data transformation is one of the crucial part in preparing our interactive charts. In the past, we’ve relied on R’s base functionality for data preparation. This time, by using dplyr, we’ll learn on new techniques that may greatly increase our productivity when working with R.
This technique is developed as “a grammar of data manipulation”, and works by providing a consistent set of “verbs” that help us solve the most common data manipulation challenges:
select(): For select-ing columns## # A tibble: 2,088 x 3
## year major_category percent_female
## <dbl> <chr> <dbl>
## 1 2013 Management, Business, and Financial 23.6
## 2 2013 Management, Business, and Financial 30.3
## 3 2013 Management, Business, and Financial 43.5
## 4 2013 Management, Business, and Financial 58.7
## 5 2013 Management, Business, and Financial 41.7
## 6 2013 Management, Business, and Financial 63.5
## 7 2013 Management, Business, and Financial 33.6
## 8 2013 Management, Business, and Financial 27.5
## 9 2013 Management, Business, and Financial 53.5
## 10 2013 Management, Business, and Financial 76.9
## # … with 2,078 more rows
## # A tibble: 2,088 x 3
## year major_category percent_female
## <dbl> <chr> <dbl>
## 1 2013 Management, Business, and Financial 23.6
## 2 2013 Management, Business, and Financial 30.3
## 3 2013 Management, Business, and Financial 43.5
## 4 2013 Management, Business, and Financial 58.7
## 5 2013 Management, Business, and Financial 41.7
## 6 2013 Management, Business, and Financial 63.5
## 7 2013 Management, Business, and Financial 33.6
## 8 2013 Management, Business, and Financial 27.5
## 9 2013 Management, Business, and Financial 53.5
## 10 2013 Management, Business, and Financial 76.9
## # … with 2,078 more rows
filter(): for filter-ing row## # A tibble: 522 x 3
## year major_category percent_female
## <dbl> <chr> <dbl>
## 1 2016 Management, Business, and Financial 23.8
## 2 2016 Management, Business, and Financial 29.5
## 3 2016 Management, Business, and Financial 46.8
## 4 2016 Management, Business, and Financial 58.7
## 5 2016 Management, Business, and Financial 44.9
## 6 2016 Management, Business, and Financial 67.3
## 7 2016 Management, Business, and Financial 39.5
## 8 2016 Management, Business, and Financial 26.8
## 9 2016 Management, Business, and Financial 52.6
## 10 2016 Management, Business, and Financial 75.1
## # … with 512 more rows
mutate(): For manipulating column; either manipulate existing column, or create new column.workers %>%
select(year, major_category, percent_female) %>%
filter(year == 2016) %>%
mutate(percent_male = 100-percent_female)## # A tibble: 522 x 4
## year major_category percent_female percent_male
## <dbl> <chr> <dbl> <dbl>
## 1 2016 Management, Business, and Financial 23.8 76.2
## 2 2016 Management, Business, and Financial 29.5 70.5
## 3 2016 Management, Business, and Financial 46.8 53.2
## 4 2016 Management, Business, and Financial 58.7 41.3
## 5 2016 Management, Business, and Financial 44.9 55.1
## 6 2016 Management, Business, and Financial 67.3 32.7
## 7 2016 Management, Business, and Financial 39.5 60.5
## 8 2016 Management, Business, and Financial 26.8 73.2
## 9 2016 Management, Business, and Financial 52.6 47.4
## 10 2016 Management, Business, and Financial 75.1 24.9
## # … with 512 more rows
group_by(): For setting group to our datasummarise(): For taking a summary from our dataungroup(): For unsetting groupWithout adding group_by(), summarise() will take a summary from all existing rows in defined numerical column:
workers %>%
# select(year, major_category, percent_female) %>%
filter(year == 2016) %>%
mutate(percent_male = 100-percent_female) %>%
summarise(
percent_female = mean(percent_female),
percent_male = mean(percent_male)
)## # A tibble: 1 x 2
## percent_female percent_male
## <dbl> <dbl>
## 1 36.3 63.7
By adding group_by(), summarise() will give summaries grouped by categorical column:
workers %>%
# select(year, major_category, percent_female) %>%
filter(year == 2016) %>%
mutate(percent_male = 100-percent_female) %>%
group_by(major_category) %>%
summarise(percent_male = mean(percent_male),
percent_female = mean(percent_female)) %>%
ungroup()## # A tibble: 8 x 3
## major_category percent_male percent_female
## <chr> <dbl> <dbl>
## 1 Computer, Engineering, and Science 72.3 27.7
## 2 Education, Legal, Community Service, Arts, and Me… 44.6 55.4
## 3 Healthcare Practitioners and Technical 34.8 65.2
## 4 Management, Business, and Financial 53.3 46.7
## 5 Natural Resources, Construction, and Maintenance 94.2 5.78
## 6 Production, Transportation, and Material Moving 77.9 22.1
## 7 Sales and Office 41.3 58.7
## 8 Service 53.7 46.3
Extra Notes: Why you should use ungroup() after every group_by()
group_by() adds metadata to a data.frame that marks how rows should be grouped. As long as that metadata is there, all transformation that you do after the grouping will involved all the grouping columns.
See the following example:
# Avoid potential unintended errors due to the grouping.
# the following fails because mutate is trying
# change one of the columns used by group_by
# and it can see that because of the meta-data
# passed through by dplyr::summarize
# workers %>%
# group_by(major_category, minor_category) %>%
# summarise(percent_male = mean(percent_male),
# percent_female = mean(percent_female)) %>%
# # ungroup() %>% # ungroup removes any grouping meta-data
# mutate(minor_category = reorder(minor_category, percent_male))arrange(): For arranging our rows based on a column valueworkers %>%
# select(year, major_category, percent_female) %>%
filter(year == 2016) %>%
mutate(percent_male = 100-percent_female) %>%
group_by(major_category) %>%
summarise(percent_male = mean(percent_male),
percent_female = mean(percent_female)) %>%
ungroup() %>%
arrange(desc(percent_female))## # A tibble: 8 x 3
## major_category percent_male percent_female
## <chr> <dbl> <dbl>
## 1 Healthcare Practitioners and Technical 34.8 65.2
## 2 Sales and Office 41.3 58.7
## 3 Education, Legal, Community Service, Arts, and Me… 44.6 55.4
## 4 Management, Business, and Financial 53.3 46.7
## 5 Service 53.7 46.3
## 6 Computer, Engineering, and Science 72.3 27.7
## 7 Production, Transportation, and Material Moving 77.9 22.1
## 8 Natural Resources, Construction, and Maintenance 94.2 5.78
drop_na(): For dropping any NA rows in specified column(s)## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 4 65 846
# drop NA rows with `drop_na()`
workers <- workers %>%
drop_na(total_earnings_male, total_earnings_female)
# check NA values
colSums(is.na(workers))## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 0 0 777
n(): For counting number of row, either for all row or by group## # A tibble: 1 x 1
## n_total
## <int>
## 1 2019
# count number of observations based on grouping category
workers %>%
group_by(major_category) %>%
summarise(n_total = n()) %>%
ungroup()## # A tibble: 8 x 2
## major_category n_total
## <chr> <int>
## 1 Computer, Engineering, and Science 235
## 2 Education, Legal, Community Service, Arts, and Media 168
## 3 Healthcare Practitioners and Technical 124
## 4 Management, Business, and Financial 232
## 5 Natural Resources, Construction, and Maintenance 282
## 6 Production, Transportation, and Material Moving 429
## 7 Sales and Office 280
## 8 Service 269
pivot_longer() / pivot_wider() for data reshaping# use `pivot_longer` to convert wide to long
df <- workers %>%
filter(year == 2016) %>%
mutate(percent_male = 100-percent_female) %>%
group_by(major_category) %>%
summarise(percent_male = mean(percent_male),
percent_female = mean(percent_female)) %>%
ungroup() %>%
arrange(desc(percent_female))
df %>%
pivot_longer(cols = -major_category)## # A tibble: 16 x 3
## major_category name value
## <chr> <chr> <dbl>
## 1 Healthcare Practitioners and Technical percent_male 35.9
## 2 Healthcare Practitioners and Technical percent_female 64.1
## 3 Sales and Office percent_male 41.3
## 4 Sales and Office percent_female 58.7
## 5 Education, Legal, Community Service, Arts, and Media percent_male 44.6
## 6 Education, Legal, Community Service, Arts, and Media percent_female 55.4
## 7 Service percent_male 53.0
## 8 Service percent_female 47.0
## 9 Management, Business, and Financial percent_male 53.3
## 10 Management, Business, and Financial percent_female 46.7
## 11 Computer, Engineering, and Science percent_male 72.0
## 12 Computer, Engineering, and Science percent_female 28.0
## 13 Production, Transportation, and Material Moving percent_male 77.0
## 14 Production, Transportation, and Material Moving percent_female 23.0
## 15 Natural Resources, Construction, and Maintenance percent_male 93.5
## 16 Natural Resources, Construction, and Maintenance percent_female 6.50
# example on how to use `pivot_wider`
df_long <- df %>%
pivot_longer(cols = -major_category)
df_long %>%
pivot_wider(names_from = name, values_from = value)## # A tibble: 8 x 3
## major_category percent_male percent_female
## <chr> <dbl> <dbl>
## 1 Healthcare Practitioners and Technical 35.9 64.1
## 2 Sales and Office 41.3 58.7
## 3 Education, Legal, Community Service, Arts, and Me… 44.6 55.4
## 4 Service 53.0 47.0
## 5 Management, Business, and Financial 53.3 46.7
## 6 Computer, Engineering, and Science 72.0 28.0
## 7 Production, Transportation, and Material Moving 77.0 23.0
## 8 Natural Resources, Construction, and Maintenance 93.5 6.50
Read in data/youtubetrends.csv and save it as vids, then follow the following instructions:
vids dataframe, create two new columns; likesperview which stores the ratio of likes/view and dislikesperview which stores the ratio of dislikes/view:summarise() is compatible with almost all the functions in R. By using the n() function, count the total number of trending videos (1 row = 1 video) in each channel. Take only the channels that have at least 10 videos being trending and save it as vids_top:vids_top as a long-format dataframe using pivot_longer():ggplot + plotlyTo wrap all the process we performed earlier, in the following chunk, we’ll start off by re-reading, tidying & transforming the data:
# read data
workers <- read_csv("data/jobs_gender.csv")
# read theme from RDS
theme_algoritma <- readRDS('assets/theme_algoritma.rds')
# tidy data
workers <- workers %>%
mutate(percent_male = 100-percent_female) %>%
drop_na(total_earnings_male, total_earnings_female)I’ve also copy-paste the earlier transformation process and save it as workers_gap dataframe. Using this data, we’ll visualize the men vs. women workers gender gap in 2016.
# transform data
workers_gap <- workers %>%
filter(year == 2016) %>%
group_by(major_category) %>%
summarise(Male = mean(percent_male),
Female = mean(percent_female)) %>%
ungroup() %>%
mutate(
major_category = reorder(major_category,
Male-Female)
) %>%
pivot_longer(cols = -major_category) %>%
mutate(
text = glue::glue('{name}: {round(value,2)}%')
)# visualize
plot <- ggplot(workers_gap, aes(x = value, y = major_category, text = text))+
geom_col(aes(fill = name))+
geom_vline(xintercept = 50, linetype = "dotted")+
labs(x=NULL, y = NULL, title = "US Labor Force Participation in 2016")+
theme(legend.position = "none")+
scale_x_continuous(labels = scales::unit_format(unit = "%"))+
theme_algoritma
# add interactivity
ggplotly(plot, tooltip = "text")Using ggplot & plotly, recreate the plot in assets/divedeep2.html file!
# Reference answer for previous dive deeper
workers_earn <- workers %>%
filter(year == 2016) %>%
group_by(major_category) %>%
summarise(Male = mean(total_earnings_male),
Female = mean(total_earnings_female)) %>%
ungroup() %>%
mutate(
major_category = reorder(major_category,
Male-Female)
) %>%
pivot_longer(cols = -major_category) %>%
mutate(
text = glue::glue('{name}: {round(value,2)}$')
)
# visualize
plot_earn <- ggplot(workers_earn, aes(x = value, y = major_category, text = text))+
geom_col(aes(fill = name), position = "dodge")+
labs(x=NULL, y = NULL, title = "US Gender Pay Gap in 2016")+
theme(legend.position = "none")+
scale_x_continuous(labels = scales::unit_format(unit = "$"))+
theme_algoritma
# add interactivity
ggplotly(plot_earn, tooltip = "text")ggarrange() and ggexport() from ggpubr library to export as simple pdf file:Easy interactive dashboards for R:
- Use R Markdown to publish a group of related data visualizations as a dashboard.
- Support for a wide variety of components including htmlwidgets; base, lattice, and grid graphics; tabular data; gauges and value boxes; and text annotations.
More about flexdashboard: - https://rmarkdown.rstudio.com/flexdashboard/index.html
- https://rmarkdown.rstudio.com/flexdashboard/using.html#storyboards
To keep the notebook light, the shiny section will be created under different rmd file. Please open shiny.Rmd for the next section.